Method and system for scoring credibility of information sources

ABSTRACT

A method for classifying information sources and content based on credibility, reliability, or trust. A content item describing an event is retrieved from an information provider and parsed for links. Each link is evaluated and attributed a sentiment score. The same event is identified in a set of know sources and an event score is calculated based on the credibility of each of the known sources. Finally, the content item is ranked based on the event and sentiment scores.

FIELD OF THE INVENTION

The present invention relates generally to information extraction. Moreparticularly, the present invention relates to classifying or rankinginformation sources and events during extraction.

BACKGROUND OF THE INVENTION

The internet is one of the primary sources of information of modernlife.

However, on the web, there coexists a lot of valuable, useful andaccurate information together with misleading or inaccurate information.There also exists sources of information that are more trusted and thosethat are less trusted, and other sources which cannot readily beidentified as trusted or not trusted. General web-based searching canreturn information that is harmful or misleading. The use ofnon-credible sources of information as basis for decisions can have asevere impact in fields like politics, health, finance and many others.For instance, in the 2008 U.S. presidential campaign of Barack Obama,misleading information connecting the future president to a Muslim faithorganization resulted in substantial confusion among voters. Variousother instances of false or misleading reports emanating from theinternet have been document, and have had consequences affecting livesand decisions. In more daily and personal applications, informationobtained from the internet serves as a basis for decision making ininsurance underwriting processes, credit and lending decisions, mergerand acquisitions, fraud detection, hiring decisions and many others. Inthis sense, credibility assessments are becoming of increasingimportance in order to build judgment skills to properly discern betweendifferent sources of information, and to address contradictions ininformation from various sources.

Prior art approaches to this problem have attempted to reduce web spamby developing credibility based link analysis algorithms like the onesused in common search engines. Common examples include the PageRankalgorithm developed and used by Google™, the TrustRank algorithmdeveloped by Stanford University and Yahoo!™ and the HITS algorithmwhich was a precursor to the PageRank algorithm. Each of these prior artapproaches rely on the assumption that the quality of a web page iscorrelated to the quality of its links, and return, in response to asearch query, a ranked list of web pages as a result of a search.Spammers have created several ways to take advantage of how searchengines operate like “hijacking” trusted web pages and building“honeypots” or groups of legitimate-appearing web pages to inducetrusted pages to link them. Recent studies (such as (i) D. Fetterly, M.Manasse, and M. Najork. Spam, damn spam, and statistics. WebDB, 2004 and(ii) Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating Web spamwith TrustRank. VLDB, 2004.) suggest 26% of web content is spam. On topof this, there is some amount of inaccurate or mistrusted informationthat cannot be properly described as spam.

As is evident, prior art approaches have been suitable for ranking webpages and providing a list of hits in response to a search request, butare inadequate for assessing the reliability of the information, thereliability of the links to other sources on web pages, or thereliability of events being described with sufficient confidence topermit decision-makers to rely on this information without a significantdue diligence burden.

SUMMARY OF THE INVENTION

In contrast to prior art approaches, the present invention does notattempt to determine if a source is spam, but rather, attempts to assessthe underlying credibility of sources and the probability thatinformation from the underlying source, such as an event or a purportedfact has occurred, is truthful or reliable. Events or facts may bederived from more than one source, and it is the events or factsthemselves that are assessed for their reliability, rather than the webpages themselves.

According to one embodiment of the invention, there is provided acomputer-implemented method for ranking information stored on a computerreadable medium; the method includes extracting a content itemdescribing an event from an information source; parsing by a parsingmodule the content item for a plurality of source links; attributing bya content analysis module a sentiment score to each source link; whereinthe sentiment score is indicative of the relative credibility of each ofthe source links; scoring by a scoring module the information sourcebased on the source links and on the sentiment score ranking the contentitem based on a score associated with the information source.

According to one aspect of this embodiment, the scoring comprises

calculating r from equation (1):

r=α*T+(1−α)*d  (1)

where d is a non-zero static score distribution vector, T is atransition matrix, and α is a predetermined constant; and, wherein eachterm in the transition matrix is modified by a non-zero sentiment score.

According to another aspect of this embodiment, the method furtherincludes storing on a score database implemented on a computer readablemedium the score for the information source.

According to another aspect of this embodiment, the method furtherincludes prior to the parsing step determining whether the informationsource has an associated score in the score database, and upondetermining that the information source has an associated score in thescore database, retrieving the score and returning to the extractingstep.

According to another aspect of this embodiment, the method furtherincludes identifying an event from each source in a set of informationsources; calculating an event score for the content item describing theevent based on a credibility score for each of the known sources in theset of information sources; and combining the score for a respectiveinformation source with the credibility score to determine a cumulativeevent score.

According to another aspect of this embodiment, each of the sources inthe set of information sources is classified as one of a known goodsource, a known bad source and an unknown reliability source, andwherein the calculating an event score is biased towards sourcesidentified as known good sources.

According to another aspect of this embodiment, the event score iscalculated as:

${EventScore} = \{ \begin{matrix}{{A\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {contains}\mspace{14mu} a\mspace{20mu} {KG}},} \\{\frac{A}{m}\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {contains}\mspace{14mu} b\mspace{14mu} {unknown}\mspace{14mu} {and}\mspace{14mu} {no}\mspace{14mu} {KG}\mspace{14mu} {nor}\mspace{14mu} {KB}} \\{\frac{A}{n}\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {has}\mspace{14mu} {less}\mspace{14mu} {then}\mspace{14mu} b\mspace{14mu} {unknowns}\mspace{14mu} {and}\mspace{14mu} {no}\mspace{14mu} {KG}\mspace{14mu} {nor}\mspace{14mu} {KB}} \\{\frac{A}{p}\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {has}\mspace{14mu} {no}\mspace{14mu} {KG}\mspace{14mu} {and}\mspace{14mu} {includes}\mspace{14mu} a\mspace{14mu} {KB}}\end{matrix} $

where A, m, n and p are parameters selected such that A/p<A/n<A/m<A; KGis a known good source; KB is a known bad source.

According to another aspect of this embodiment, the identified eventsare compared to identify contradictions, and the calculating an eventscore includes biasing events from known good sources to resolve thecontradictions.

According to another aspect of this embodiment, the combining comprisescalculating an event ranking representative of the event being reliable.

According to another aspect of this embodiment, calculating the eventranking is calculated as:

EventRank=a*EventScore+b*ΣLinkScore+c*LinksToEvent

where a, b and c are weighted coefficients and LinksToEven is calculatedas:

LinksToEvent=Σ_(i=1) ^(n)LinkScore(i)×Sent

where n is the number of sources and LinkScore is the score of theinformation source, and Sent is the sentiment score.

According to another aspect of this embodiment, the method furtherincludes calculating an accumulated event rank for the informationprovider from a plurality of event ranks by:

${{AccumulatedEventRan}k} = {\frac{1}{N_{i}}{\sum_{j = 1}^{N_{i}}{{EventRank}(j)}}}$

where EventRank(j) is a plurality of event scores for a plurality ofcontent items and N_(i) is a total number of content items of theinformation provider in the known source database.

According to a second embodiment of the invention, there is provided acomputer-implemented method for ranking information stored on a computerreadable medium; the method including identifying an event from eachsource in a set of information sources; calculating an event score for acontent item describing the event based on a credibility score for eachof the known sources in the set of information sources; and combining ascore for a respective information source with the credibility score todetermine a cumulative event score.

According to one aspect of this second embodiment, each of the sourcesin the set of information sources is classified as one of a known goodsource, a known bad source and an unknown reliability source, andwherein the calculating an event score is biased towards sourcesidentified as known good sources.

According to another aspect of this second embodiment, the event scoreis calculated as:

${EventScore} = \{ \begin{matrix}{{A\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {contains}\mspace{14mu} a\mspace{20mu} {KG}},} \\{\frac{A}{m}\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {contains}\mspace{14mu} b\mspace{14mu} {unknown}\mspace{14mu} {and}\mspace{14mu} {no}\mspace{14mu} {KG}\mspace{14mu} {nor}\mspace{14mu} {KB}} \\{\frac{A}{n}\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {has}\mspace{14mu} {less}\mspace{14mu} {then}\mspace{14mu} b\mspace{14mu} {unknowns}\mspace{14mu} {and}\mspace{14mu} {no}\mspace{14mu} {KG}\mspace{14mu} {nor}\mspace{14mu} {KB}} \\{\frac{A}{p}\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {has}\mspace{14mu} {no}\mspace{14mu} {KG}\mspace{14mu} {and}\mspace{14mu} {includes}\mspace{14mu} a\mspace{14mu} {KB}}\end{matrix} $

where A, m, n and p are parameters selected such that A/p<A/n<A/m<A; KGis a known good source; KB is a known bad source.

According to another aspect of this second embodiment, the identifiedevents are compared to identify contradictions, and the calculating anevent score includes biasing events from known good sources to resolvethe contradictions.

According to another aspect of this second embodiment, the combiningcomprises calculating an event ranking representative of the event beingreliable.

According to another aspect of this second embodiment, whereincalculating the event ranking is calculated as:

EventRank=a*EventScore+b*ΣLinkScore+c*LinksToEvent

where a, b and c are weighted coefficients and LinksToEven is calculatedas:

LinksToEvent=Σ_(i=1) ^(n)LinkScore(i)×Sent

where n is the number of sources and LinkScore is the score of theinformation source, and Sent is the sentiment score.

According to another aspect of this second embodiment, the methodfurther includes comprising calculating an accumulated event rank forthe information provider from a plurality of event ranks by:

${{AccumulatedEventRan}k} = {\frac{1}{N_{i}}{\sum_{j = 1}^{N_{i}}{{EventRank}(j)}}}$

where EventRank(j) is a plurality of event scores for a plurality ofcontent items and N_(i) is a total number of content items of theinformation provider in the known source database.

According to another aspect of this second embodiment, the score for arespective information source is determined by: extracting a contentitem describing an event from an information source; parsing by aparsing module the content item for a plurality of source links;attributing by a content analysis module a sentiment score to eachsource link; wherein the sentiment score is indicative of the relativecredibility of each of the source links; scoring by a scoring module theinformation source based on the source links and on the sentiment score;and ranking the content item based on a score associated with theinformation source.

According to another aspect of this second embodiment, the scoringcomprises

calculating r from:

r=α*T+(1−α)*d

where d is a non-zero static score distribution vector, T is atransition matrix, and α is a predetermined constant; and, wherein eachterm in the transition matrix is modified by a non-zero sentiment score.

According to another aspect of this second embodiment, the methodfurther includes storing on a score database implemented on a computerreadable medium the score for the information source.

According to another aspect of this second embodiment, the methodfurther includes prior to the parsing step determining whether theinformation source has an associated score in the score database, andupon determining that the information source has an associated score inthe score database, retrieving the score and returning to the extractingstep.

According to another aspect of this second embodiment, the sentimentscore is derived using a sentiment scorer that was created at least inpart using:

a training set of known true and known false events; and

known links and the associated text of the known links to content thatspecifies the known true and known false events.

According to another aspect of this second embodiment, the ranking scoreis used for any one or more of insurance underwriting, assessingsuspected fraudulent activity, credit decisioning, securities trading,insurance underwriting.

According to other aspects of the invention, non-transitory computerreadable media include computer executable instructions for carrying outthe methods as herein described. In still other embodiments, computersystems for implementing the methods of the above-described embodimentsare disclosed.

BRIEF DESCRIPTION OF THE DRAWINGS

An embodiment will now be described, by way of example only, withreference to the attached Figures, wherein:

FIG. 1 shows a high-level architecture of a system for acquiring contentitems and applying scoring and ranking to the content.

FIG. 2 shows a schematic of a computer system that may be used toimplement various parts of the invention.

FIG. 3 shows a flow diagram of the method of scoring links associatedwith the content item.

FIG. 4 shows a flow diagram of the method of the event analyzer wherethe event classification and scoring is performed.

FIG. 5 shows an example of event classification and scoring.

DETAILED DESCRIPTION OF THE EMBODIMENT

As mentioned earlier, the present invention assesses the underlyingcredibility of sources and the probability that information fromunderlying sources, such as an event or a purported fact has occurred,is truthful or reliable. Events or facts may be derived from more thanone source, and it is the events or facts themselves that are assessedfor their reliability, rather than the web pages themselves. For thepurposes of this description, the term “event” is used to describe apiece of information that is being subjected to the credibilityassessment. An event as used herein may be any piece of information orpurported fact, generally determined to be of significance to a requestfor information, such as a web-search. The term event is being used, inpart because by definition the invention in its preferred embodiment isused to assess the reliability of a reported event having occurred, orfacts identified as being relevant to a reported event. The inventiondoes not, per se, relate to determining whether known facts areapplicable to a user's query, for example, whether a particularscientific formula is relevant to solving a problem posed by a user'squery.

Furthermore, the preferred embodiments are described with respect toonline news source, but the sources of information for assessing thecredibility of a reported event are not limited to these. Other sourcesmay equally be used as inputs to the invention for the credibilityanalysis, including but not limited to RSS feeds, discussion forums,social media, such as Facebook™ or Twitter™, posts, emails, electronicjournals, databases and/or web pages from a multitude of other sources.It may also be applied to information available on local networks thatare not generally available to the public. In this manner, where theinvention is being used by an institution for diligence purposes, suchas fraud, insurance or personnel research, source of informationbelonging to or accessible only by the institution can also be includedin the search universe to generate a higher degree of confidence in theresults.

FIG. 1 shows a network of computer systems 2 having an informationprovider 4 that provides information content via the Internet 6. Clientdevices such as a desktop computer 8, a tablet computer 10, or a mobilesmartphone 12 request the information content using a hypertext transferprotocol (HTTP) that are transmitted over a wired or wireless link tothe Internet 6 to the server systems of the information provider 4. Theinformation provider in turn supplies the requested article to theclient device. The computing structure 14 can reside on the clientdevice, a proxy server(s), or other trusted computer system(s) on theInternet 6 or a combination thereof.

FIG. 2 shows a computer system 2, and includes a number of physical andlogical components, including a central processing unit (“CPU”) 24,random access memory (“RAM”) 28, an input/output (“I/O”) interface 32, anetwork interface 36, non-volatile storage 4, a display 40 and a localbus 44 enabling the CPU 24 to communicate with the other components. TheCPU 24 executes an operating system, and a number of software systemsand/or software modules. RAM 28 provides relatively-responsive volatilestorage to the CPU 24. The I/O interface 32 allows for human-computerinput to be received from one or more devices, such as a keyboard, amouse, touch screen etc., and outputs information to output devices,such as a display and/or speakers. The network interface 36 (e.g.Ethernet, WiFi, Bluetooth, etc) permits communication with elements innetwork communication, and provides access to the internet. A number ofthese computer systems may be networked together, host information fromother sources, etc. Non-volatile storage 4 stores the operating systemand programs. During operation of the computer system, the operatingsystem, the programs and the data may be retrieved from the non-volatilestorage 4 and placed in RAM 28 to facilitate execution. These computersystems are known in the art, and their communications with the internetand other networks are also known. It is within this infrastructure thatthe preferred embodiments of the invention operate.

Broadly, the invention provides for two complementary approaches forgenerating a rank or score, although it is worth noting that each of theapproaches could also be used independently to arrive at partial orintermediate results that are also useful. First, a method is describedthat assesses the reliability of the source of information, particularlythe reliability of links. Next, a method is described that assesses thereliability of the content of the information regarding the eventitself. A method of combining these approaches completes the preferredembodiment provides for a two-pronged approach to assessing thereliability of derived information.

Assessing the Reliability of Sources

Turning now to FIG. 3, there is illustrated a method for ranking orscoring links obtained from a source of information, such as a webpage.While prior art methods for assessing the quality of web pages by virtueof the links on those web pages do exist, the present invention providesthis functionality in a more robust manner as will shortly be described.Prior art processes and algorithms use a random or biased web crawler toevaluate the rank of a page. After a certain number of iterations, therandom crawler will locate the pages with a higher rank with a higherprobability of being relevant. This approach includes the assumptionthat at a given web page the crawler randomly selects the links locatedat the web page, or select pages related to given subjects in a biasedmanner. This assumption is to some extent contrived since a real crawlerwill not act randomly when selecting links to follow but will selectlinks based mainly on information accompanying the link. Some prior artweb crawlers will mainly follow positive links and discard the negativeones. For example, if a link to a page said “this content is wrong” andanother link said “here is the right answer”, the positive link wouldreceive a positive bias. One implementation of this includes assigning aprobability factor (score) associated with the sentiment related to alink, a (0-1) value which would map from (negative sentiment-positivesentiment). The map could be done from a discrete sentiment score basefor example with 3 levels, (positive, negative, no sentiment) or it canbe done from a continue sentiment score base assigning sentiment scoresto terms. A similar method was applied for Blog Distillation, thereference is “Blog Distillation via Sentiment-Sensitive Link Analysis,Giacomo Berardi et al. Natural Language Processing and InformationSystems, Lecture Notes in Computer Science Volume 7337, 2012, pp 228-233

The preferred embodiment of the present invention includes a crawlerthat considers the sentiment (more specifically, the trust orcredibility) relating to the link source to influence the crawlingdecisions. A probability function is determined that assigns differentprobability values to the links in the content item (typically a webpage) according to the sentiments attached to the link sources.

For the purposes of this disclosure, a sentiment or sentiment score orsentiment ranking refers to the relative trust or credibility of linksor references found at an information source in respect of an event. Toillustrate this in simplest terms, the links on a particular source ofinformation can be manually reviewed and identified as having a net“positive” or a net “negative” sentiment. Of course, the sentiment doesnot have to be a binary indicator, and it is preferable to have aplurality of degrees of sentiment. One method of establishing sentimentis described below. Alternate methods of determining sentiment are alsocontemplated, including algorithms, references to databases of knownsentiment levels, etc.

As shown in FIG. 3, a content item in respect of an event or other pieceof information is retrieved from the information provider via a sourceon the internet 6. Optionally, a determination is made at 302 whetherthe source of the retrieved content item has been previously scored. Ifthe source has been previously scored, the sentiment score is retrievedfrom the saved score database 304. If the source of the content item hasnot been previously scored, a parsing engine 306 parses within thecontent item for links to other sources (e.g. source links). Each of thesource links found are extracted by an extraction module 308, optionallyalong with the associated information of the source link such as thesentence in which the source link is found, the link descriptor, etc.The associated source itself or the information contained therein maythen be analyzed by a content analysis module 310 to determine thesentiment or credibility associated with the source link.

In one embodiment, link sentiment can be composed as both a component ofthe source and the fundamental text around and included in the link tothe content being analyzed. In one embodiment, the sentiment can belearned in an iterative semi-supervised or unsupervised approachpost-seeding. In such an approach, a “dictionary” of known events orfacts can be used to train the sentiment analysis scorer. As an example,consider the event “the Toronto Maple Leafs won the Stanley Cup in 1967”as a true event. The training process may include:

-   -   A. Create a dictionary of “trusted events” known to have        occurred.    -   B. For each trusted event:        -   B1. Determine known content that specify this event.        -   B2. Find sources that link to that content.        -   B3. Extract the text from the source that corresponds to the            link.    -   C. Build a corpus of “trusted texts” representing the links to        trusted events    -   D. Repeat B-C against controversial or untrue events to build a        corpus of “untrusted texts” representing the links to untrusted        events.    -   E. Build a text classifier or scorer based on similarity        measures or other approaches to determine the link sentiment of        unknown texts.

For a given event, the link sentiment information is then used to builda sentiment-adjusted matrix via module 312 which includes the sentimentsrelated to the link connections. The scoring calculation engine 314 usesthe adjacent matrix to iteratively compute the sentiment scores of eachof the source links in the content item. The sentiment scores are thensaved in the score database 304 which can be further accessed for fastscore returns during future content item evaluations as described above.

While the general method described above, is thought to be novel,additional details of implementation will now be described, which enablecertain method steps in a manner that would not be apparent to oneskilled in the art. These details of implementation described in detailare considered non-obvious contributions to the art.

In particular embodiments, an algorithm is provided to score or rank(the sources based on their link connections using equation (1):

r=α*T+(1−α)*d  (1)

where r is the score, d is a static score distribution vector with agiven non zero entry and T is the transition matrix. α represents adecay factor, which is a constant that adjusts for the reliability ofinformation, as represented by the probability that the crawler willfollow an outlink from a given page. The decay constant is usually inthe range 0.8-0.9 and it represents the probability that the crawlerwill follow an outlink from a given page. This rank estimation issimilar to the TrustRank algorithm where linear dependencies on thenumber of in-links and out-links are considered. The rank is evaluatediteratively assuring convergence conditions are fulfilled. Generationthe transition matrix is generally known from the TrustRank approach,and from other sources and is not described in further detail herein.

Of particular pertinence to this preferred embodiment is that thesentiment score of each of the source links is used to adjust the valuesof the links in the transition matrix, giving different probabilitiesdepending on the sentiment score for different links. At a given node inthe transition matrix, the probabilities for the transitions (e.g. linkclicks) must be evaluated depending on the number of nodes andsentiments. If the node has A positive links, B negative links and Csentiment-unknown links, then it is possible to calculate thisprobability as: n

Ax ₊ +Bx ⁻ +Cx _(nons)=1  (2)

where x₊, x⁻, x_(nons) are the probabilities for a positive, negativeand non-sentiment links. In general, it is assumed that the positivelink will be clicked with higher probability than the non-sentiment andthat the negative link will have the smallest probability, for example:

$\begin{matrix}{x_{-} = {{\frac{x_{+}}{n}\mspace{14mu} x_{nons}} = \frac{x_{+}}{m}}} & (3)\end{matrix}$

where m and n are parameters that can be varied with the only conditionthat n>m. The sum of the probabilities must be normalized to 1.

For example, in an extreme case a positive sentiment transition matrixcan be constructed by removing all source links with negative sentiment.Similarly, a negative sentiment transition matrix can be constructed byremoving all links with positive sentiment. From the positive transitionmatrix one obtains the higher rank for the highest trusted sources. Inthe case of the negative matrix the highest ranks will be obtained forthe least trusted sources. These two ranks or scores can then be mergedin a normalized way to obtain a final score. One simple approach is todivide every value by the maximum score in each case (positive andnegative). In the case of the non-sentiment links, they can be treatedtogether with the positive links giving smaller probabilities for theseevents in the transition matrix.

It will be understood that all approaches described herein areimplemented on computer readable media and executed by a computer systemas described earlier.

Assessing the Trustworthiness of Events

Turning now to FIG. 4, the event-based engine generally assumes that asource of a content item describing an event has a credibility scorebased on other known sources describing the same or similar event. Forexample, a trusted source will share more information on an event with aknown good (KG) source and less information with a known bad (KB)source. A content item is retrieved from the Internet 6. The request forthe content item can be done manually by direct user request 404 or byan automated crawler 406.

The information is translated via a translation engine 408 into a formreadable by event analyzer 410. Event analyzer 410 is a softwareimplemented module that carries out a classification based on thesimilarity to other events stored within a defined time window. Theevent analyzer 410 retrieves the translated content item from thetranslation engine and passes it through a natural language processingalgorithm (NLP) 412 to identify the events present within the contentitem to create a set of identified events. The NLP 412 algorithms arebased on similarity measures plus keyword search and can have machinelearning components known to those of skill in the art. In one approach,regular expressions (regex) or other pattern based approaches are usedto identify set events as represented by set patterns of text. Forexample, “Acme Co. was acquired by Bob's company”, could be representedby a rule that extracts based on the pattern that looks for textcontaining “was acquired by”. More sophisticated examples, such asparser-based extraction, knowledge-based extraction, etc are describedby Hogenboom et al(http://iswc2011.semanticweb.org/fileadmin/iswc/Papers/Workshops/DeRiVE/derive2011_submission_1.pdf). The identified events and associated informationprovider may be stored in an events 414 database for further use.

A calendar may provide temporal information such as time and date to theevent analyzer 410 in order to reduce noise levels when comparingevents. The event analyzer 410 then performs a time-limited query of theevents database 414 to identify candidate events for comparison to theidentified events from the content item. An analyses of the candidateevents with respect to the identified events from the content item iscarried out to evaluate the candidate events with respect to theidentified events in the content item and determines if any of thecandidate events are contradictory to each other

The candidate events and associated sources as well as the identifiedevents from the content item are then evaluated by an event scoringmodule 420. The event scoring module 420 optionally first performs aquery of a scores database 422 for each of the candidate events in orderto determine if the source of information is a Known Good (KG), a KnownBad (KB), or an unknown source in terms of credibility and trust. Theevent score (or rank) for the content item is then determined accordingto the distribution between KG and KB sources by the following formula:

${EventScore} = \{ \begin{matrix}{{A\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {contains}\mspace{14mu} a\mspace{20mu} {KG}},} \\{\frac{A}{m}\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {contains}\mspace{14mu} b\mspace{14mu} {unknown}\mspace{14mu} {and}\mspace{14mu} {no}\mspace{14mu} {KG}\mspace{14mu} {nor}\mspace{14mu} {KB}} \\{\frac{A}{n}\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {has}\mspace{14mu} {less}\mspace{14mu} {then}\mspace{14mu} b\mspace{14mu} {unknowns}\mspace{14mu} {and}\mspace{14mu} {no}\mspace{14mu} {KG}\mspace{14mu} {nor}\mspace{14mu} {KB}} \\{\frac{A}{p}\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {has}\mspace{14mu} {no}\mspace{14mu} {KG}\mspace{14mu} {and}\mspace{14mu} {includes}\mspace{14mu} a\mspace{14mu} {KB}}\end{matrix} $

where an event contained in at least one KG source will have an eventscore A for the corresponding news source. An event with no KG or KBincluded but with b unknown included is scored as A/m. In the case of anevent with no KG or KB involved and a number of unknowns less than a setvalue b then the score will be A/n. An event with no KG and with a KBinvolved KG will result in the score A/p, where A could be any naturalnumber for example A=10 and m, n, p are also parameters which can beadjusted for better score discrimination provided that m<n<p andA/p<A/n<A/m<A. Particular values for each of these parameters can beoptimized depending on the context in which the invention is used. Oncea score has been generated for the content item, it can be stored sothat future content requests drawing from the same source can makereference to a stored score rather than being processed again.

As noted above, identifying contradictory information to an event orother data that throws into dispute the trustworthiness of an event.This contradictory information is referred to herein alternatively as ananti-event, when the content item contradicts another known source.These anti-events can be scored in as: A/A1 if the event contradicts aKB and A/p1 when the event contradicts a KG where A1 and p1 areparameters as well which need are predetermined, and selected dependingon the context in which the invention is used. Generally, A1<≈1 andp1<≈p.

Preferably, a suitable corpus from a known set of seed sources must bepresent in the events database 410. This seed set of sources comprisesKG and KB sources that can be progressively adjusted and enriched whenmore content items (and their sources) are analyzed. The seed set can bemanually generated or could rely on other approaches based on spam ortrust detection. For example, the New York Times™ may receive a hightrust score from an anti-spam algorithm and thus would fall into the KGclassification in the database. In one embodiment, an input seed sourcecan initially include a list of relevant sources and a result of thehighest ranked sources of the link-based algorithm of the presentembodiment.

An information provider or source can gain a KG status if a sufficientnumber of content items and events is consistently shared with otherpreviously KG sources within the database. Similarly, an informationprovider can gain a KB status if a sufficient number of content itemsand events is consistently shared with other previously KB sources.Sources with content items and events sharing both KG and KB sources oronly untrusted sources will continue to be labeled as an untrustedsource.

Alternatively, content items and events with no corresponding KG or KBsources can be scored according to the number of sources agreeing ordisagreeing with the risk of adding additional noise to the scoringprocess. If the noise is too great, these content items can be omittedwhen ranking. Another approach could be to use the previous link-basedscore, which gives already an independent score base characterizing thesources and have some insights on the distribution and apply acorresponding score. For example, imagine we have the case of 5 newssources we want to characterize and 4 of them are un-trusted sources(without KG or KB included) there will be one source that willcontradict the other 4. If we know that the sources' distribution isdominated or most likely be dominated by un-trusted sources we can scorethe source that contradicts more times like a signature of trust andopposite in the case of a distribution dominated by trusted-sources.

Cumulative Scoring

The event ranking module contains the EventScore, the LinkScoreassociated with the sources within the content item and a 3rd termrelated to links and sentiment attached to the event itself as it willbe further described.

EventRank=a*EventScore+b/t*Σ _(i=1) ^(t)LinkScore(i)+c*LinksToEvent

where the a, b and c coefficients are weights, the second term takesinto account the LinkScores of the sources which reproduced the event (tis the number of sources), and LinksToEvent will map all links. In thiscase we want to differentiate links directing to a webpage and linksdirecting to a given event. From our adjacent matrix we know the linksdirecting to a webpage, so we can select the ones directing only to theevent we are analyzing and sum over their LinkScores (for each sourcewith a link directing to the event) to the given event together with thesentiment analysis attached to the link. This factor can be estimatedas:

LinksToEvent=Σ_(i=1) ^(n)LinkScore(i)×Sent

where n is the number of links directing to the content that containsthe given event. LinkScore is the score of the source of the linkdirecting to the event and Sent is a factor which considers thesentiment attached to the link.

The integrated AccumulatedEventRank will be evaluated in both cases as:

${AccumulatedEventRank} = {\frac{1}{N_{i}}{\sum_{j = 1}^{N_{i}}{{EventRank}(j)}}}$

where the individual scores are added for an information provider and anormalization is done to the total number of events of the respectiveinformation provider Ni.

In practice, the content of a source could be rather unique. Forexample, if one looks at a local newspaper and compares with what isbeing published in a national or international newspaper, just becausethey cover different objects, geographies or they have differentinterests, there will likely be very little overlap in events. This doesnot mean a source is un-trusted, and these cases will be complemented bythe LinkScore algorithm.

A final source score will be obtained after normalization of bothLinkScore and AccumulatedEventRank. The final source score can becalculated in first approach as the weighted average value of bothscores which we call SourceRank as:

SourceRank=weightLink*LinkScore+weightEvent*AccumulatedEventRank

where weightLink and weightEvent are weighting factors which areestimated using a test database with result cases, and can be optimizedby one skilled in the art based on the data set being used.

Note that both the event scoring and source scoring approaches allow fordetection and handling of badly extracted data. An importantconsideration is that event detection, even, with state-of-the-artsystems, can be error-prone. In such cases, the extracted informationmay not be accurate compared to the source. One advantage to theinvention, as described, is that an incorrectly extracted event can bediscriminated against via a low event trust score, removing noise fromthe event extraction process. If a particular source is extracted in aparticularly noisy fashion, perhaps because of the way the source isstructured, then this is reflected in the SourceRank.

Example

As an example Fig. shows 8 sources with hypothetical link connectionsand a discrete 3 level sentiment score associated with it. The originaltransition matrix would look like:

$T = \begin{pmatrix}0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\0 & {1\text{/}2} & 0 & 0 & 0 & 1 & 0 & 0 \\0 & {1\text{/}2} & 0 & 0 & 0 & 0 & 0 & 0 \\0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\0 & 0 & 0 & 0 & {1\text{/}3} & 0 & 0 & 1 \\0 & 0 & 0 & 0 & {1\text{/}3} & 0 & 0 & 0 \\0 & 0 & 0 & 0 & {1\text{/}3} & 0 & 0 & 0\end{pmatrix}$

If we consider the sentiments attached to the links we can generate anew transition matrix as for example:

$T = \begin{pmatrix}0 & 0 & 0 & 0 & 0 & 0 & 0 & 0 \\1 & 0 & 1 & 0 & 0 & 0 & 0 & 0 \\0 & {1\text{/}3} & 0 & 0 & 0 & 1 & 0 & 0 \\0 & {2\text{/}3} & 0 & 0 & 0 & 0 & 0 & 0 \\0 & 0 & 0 & 1 & 0 & 0 & 0 & 0 \\0 & 0 & 0 & 0 & 0.6 & 0 & 0 & 1 \\0 & 0 & 0 & 0 & 0.3 & 0 & 0 & 0 \\0 & 0 & 0 & 0 & 0.1 & 0 & 0 & 0\end{pmatrix}$

Where we selected some probability values for the links according to thesentiments attached. The real probability distribution might bedifferent from what we showed in this example. We can create a positiveand a negative transition matrix in the same way, by taking onlypositive and negative sentiment links.

In the Fig we specified that source 2 and 4 are trusted ones, so we canset the vector d as:

d=[0,1/2,0,1/2,0,0,0,0]

In the last example we only took into account link connections betweensources as a whole, for the EventRank we need to include also linksdirecting to a given event.

As an example, in FIG. 5, consider seven sources (labeled S1-S7): 4unknown which are to be evaluated for the EventRank value, 2 KG sourcesand 1 KB source which were previously analyzed and are now used asseeds. Suppose an event (news) that appears in the unknown sources 4, 5and in the KG source 2. The EventScore is evaluated and in parallel thelink-based analysis is conducted to determine the LinkScores associatedwith the event.

If we continue running many events then we are going to haveaccumulation data for the AccumulatedEventRank and the final SourceRank.Best and worse values for the SourceRank will provide feedback in theseed of KG and KB sources, and improving the EventRank estimationgradually. Convergence conditions need to be settle providing thatoptimum EventRank and SourceRank results are obtained.

Neural networks, cluster models, hidden Markov models, Bayesiannetworks, or other machine learning methods can also be used to classifyor create clusters for further analysis, potentially optimizing thebest-fitting algorithms, performing the calculations on a subset ofdocuments or acting as a replacement or first-pass against large sets ofdocuments. Alternatively creating decision tress or other pathoptimization approaches can be used.

The above-described embodiments may be useful in a number of contextswhere the integrity of an event or fact may be critical to ascertain.Several examples of use are now described. However, these examples arenot meant to be comprehensive. One example is for use in scoring andverifying information for an applicant for insurance underwriting. Insuch an example, it is important an applicant is not, for example,denied insurance based on incorrect information.

In another example, the methods described may be useful in assessingpossible fraudulent activity. Automated monitoring systems may generatemany alerts based on detected “events” that may not be verified. Themethods, as described, can be used to score events to determine theirvalidity. Alternatively, abnormal or unexpected events or facts could beflagged for further scrutiny.

In another example, the methods described may be employed to help withcredit decisioning, either by an automated system or to support thedecision of a loan officer. In such a scenario, assessing thetruthfulness or validity of detected information can be an importantpart of determining what information impacts the credit decision.

In another example, the methods described may be used for securitiestrading, either as support for a human trader or as part of an automatedsystem. Automated systems that trade on news or events detected arealready used by traders. Adding the ability to measure thetrustworthiness of detected events could be an important advantage forthese systems, for example, by preventing trading decisions based onfalse or poor information.

The above-described embodiments are intended to be examples of thepresent invention and alterations and modifications may be effectedthereto, by those of skill in the art, without departing from the scopeof the invention, which is defined solely by the claims appended hereto.

What is claimed is:
 1. A computer-implemented method for rankinginformation stored on a computer readable medium; the method comprising:identifying an event from each source in a set of information sources;calculating an event score for a content item describing said eventbased on a credibility score for each of the known sources in said setof information sources; and combining a score for a respectiveinformation source with said credibility score to determine a cumulativeevent score.
 2. The method according to claim 1, wherein each of saidsources in said set of information sources is classified as one of aknown good source, a known bad source and an unknown reliability source,and wherein said calculating an event score is biased towards sourcesidentified as known good sources.
 3. The method according to claim 2,wherein said event score is calculated as:${EventScore} = \{ \begin{matrix}{{A\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {contains}\mspace{14mu} a\mspace{20mu} {KG}},} \\{\frac{A}{m}\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {contains}\mspace{14mu} b\mspace{14mu} {unknown}\mspace{14mu} {and}\mspace{14mu} {no}\mspace{14mu} {KG}\mspace{14mu} {nor}\mspace{14mu} {KB}} \\{\frac{A}{n}\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {has}\mspace{14mu} {less}\mspace{14mu} {then}\mspace{14mu} b\mspace{14mu} {unknowns}\mspace{14mu} {and}\mspace{14mu} {no}\mspace{14mu} {KG}\mspace{14mu} {nor}\mspace{14mu} {KB}} \\{\frac{A}{p}\mspace{14mu} {if}\mspace{14mu} {event}\mspace{14mu} {has}\mspace{14mu} {no}\mspace{14mu} {KG}\mspace{14mu} {and}\mspace{14mu} {includes}\mspace{14mu} a\mspace{14mu} {KB}}\end{matrix} $ where A, m, n and p are parameters selected suchthat A/p<A/n<A/m<A; KG is a known good source; KB is a known bad source.4. The method according to claim 1, wherein said identified events arecompared to identify contradictions, and said calculating an event scoreincludes biasing events from known good sources to resolve saidcontradictions.
 5. The method according to claim 1, wherein saidcombining comprises calculating an event ranking representative of saidevent being reliable.
 6. The method according to claim 5, whereincalculating said event ranking is calculated as:EventRank=a*EventScore+b*τLinkScore+c*LinksToEvent where a, b and c areweighted coefficients and LinksToEven is calculated as:${{LinksToEven}t} = {\sum\limits_{i = 1}^{n}\; {{{LinkScore}(i)} \times {Sent}}}$where n is the number of sources and LinkScore is the score of saidinformation source, and Sent is the sentiment score.
 7. The methodaccording to claim 6, further comprising calculating an accumulatedevent rank for the information provider from a plurality of event ranksby:${AccumulatedEventRank} = {\frac{1}{N_{i}}{\sum_{j = 1}^{N_{i}}{{EventRank}(j)}}}$where EventRank(j) is a plurality of event scores for a plurality ofcontent items and N_(i) is a total number of content items of theinformation provider in the known source database.
 8. The methodaccording to claim 6, whereby the sentiment score is derived using asentiment scorer that was created at least in part using: a training setof known true and known false events; and known links and the associatedtext of said known links to content that specifies the known true andknown false events.
 9. The method according to claim 8, wherein saidscore for a respective information source is determined by: extracting acontent item describing an event from an information source; parsing bya parsing module the content item for a plurality of source links;attributing by a content analysis module a sentiment score to eachsource link; wherein said sentiment score is indicative of the relativecredibility of each of said source links; scoring by a scoring modulesaid information source based on said source links and on said sentimentscore; and ranking said content item based on a score associated withsaid information source.
 10. The method according to claim 9, whereinsaid scoring comprises calculating r from equation (1):r=α*T+(1−α)*d where d is a non-zero static score distribution vector, Tis a transition matrix, and α is a predetermined constant; and, whereineach term in said transition matrix is modified by a non-zero sentimentscore.
 11. The method of claim 9, further comprising storing on a scoredatabase implemented on a computer readable medium said score for saidinformation source.
 12. The method according to claim 11, furthercomprising prior to said parsing step determining whether saidinformation source has an associated score in said score database, andupon determining that said information source has an associated score insaid score database, retrieving said score and returning to saidextracting step.
 13. The method of claim 1, wherein the ranking score isused for any one or more of insurance underwriting, assessing suspectedfraudulent activity, credit decisioning, securities trading, insuranceunderwriting.